Univariate Plots Section

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Destribution looks like normal destribution. Most values are composed of 5 and 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The graph is relatively skewed to right, but no extreme outlier.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

At first, I thought it was just normal destribution, but when I looked at data with small binwidth, it turned out to be bimodal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

This distribution is skewed to right. I tried to transform data by using scale_x_sqrt ot scale_x_log10, but it did not change a shape nicely.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

In order to take closer look, I focused on 1.0 to 4.0 residual.sugar. The data destribution is skewed to right and has many outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] 0.0470653

Chlorides are concentrated around 0.08 and really small standard deviaiton, 0.047.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Since original histogram of total sulfur is skewed to right, I used scale_x_sqrt function to make data more understandable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
## [1] 0.001887334

This data has normal destribution. One thing I want to keeo in mind is that it has really small standard deviation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Normal destribution. Its median and mean are almost equal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The distribution is skewed to right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The distribution is smoothly skewed to right.

Univariate Analysis

What_is_the_structure_of_your_dataset? There are 1599 wine data with 12 variables. I deleted X and quality colum and created categorical data$quality colum.

What is/are the main feature(s) of interest in your dataset? As long as I read description of data set, I suspect volatile acidity, residual sugar, and chlorides since these factors seems to directly cause effect on taste of wine. I???d like to determine which features are best for predicting the price of a diamond.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? I square root transformed the right skewed total sulfur distribution. The transformed distribution became more similar to normal distribution shape.

Bivariate Plots Section

According to correlation matrix, price seems to correlate with volatile.acidity, density, sulphates, and alcohol. #I want to take closer look at scatter plots between them.

Looks like there is a negative correlation

Standard deviation by quality

## data$quality: 3
## [1] 0.002001845
## -------------------------------------------------------- 
## data$quality: 4
## [1] 0.001575169
## -------------------------------------------------------- 
## data$quality: 5
## [1] 0.001588504
## -------------------------------------------------------- 
## data$quality: 6
## [1] 0.002000009
## -------------------------------------------------------- 
## data$quality: 7
## [1] 0.002175739
## -------------------------------------------------------- 
## data$quality: 8
## [1] 0.002378276

Density boxplot’s range is mainly overlapped.

ggplot(aes(x=quality,y=sulphates), data=data)+
  geom_jitter(alpha=0.3)+
    geom_boxplot()

Standard deviaiton of sulphates by quality

## data$quality: 3
## [1] 0.12202
## -------------------------------------------------------- 
## data$quality: 4
## [1] 0.239391
## -------------------------------------------------------- 
## data$quality: 5
## [1] 0.1710623
## -------------------------------------------------------- 
## data$quality: 6
## [1] 0.1586495
## -------------------------------------------------------- 
## data$quality: 7
## [1] 0.1356389
## -------------------------------------------------------- 
## data$quality: 8
## [1] 0.1153795

Sulphates variable has many outliers in its boxplot

density and sulphates variables show similar type of distribution in a graph. Both of them change its value within the range of quality, but since its ranges are pretty narrow I am not sure their differences are statistically significant.

## data$quality: 3
## [1] 10
## -------------------------------------------------------- 
## data$quality: 4
## [1] 53
## -------------------------------------------------------- 
## data$quality: 5
## [1] 681
## -------------------------------------------------------- 
## data$quality: 6
## [1] 638
## -------------------------------------------------------- 
## data$quality: 7
## [1] 199
## -------------------------------------------------------- 
## data$quality: 8
## [1] 18

Quality 5 and 6 occupy most alcohol values. Even though quality 5 has some outliers, it looks like there is a positive correlation.

The plots are concentrated on low alcohol and relatively low volatile.acidity

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? Alcohol and volatile.acidity correlate with quality.

On the other hand, density and sulphates indicte relatively similar value.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? Density and fixed.acidity show 0.668 correlation. Alcohol and density indicate -0.496 correlation.

What was the strongest relationship you found? Alcohol and volatile.acidity show relativey strong correlation with quality of wine.

Multivariate Plots Section

You can see the color is changing from up-left to bottom-right. In addition, regression shows that volatile.acidity is more important factor to predict quality of wine than alcohol since many colored regression lines are horizontal, which means colored dots are scattered along with volatile.acidity. From this graph the lower the volatile.acidity become, the better the quality gets.

I suspected if volatile.acidity has correlation with density and sulphates. In fact, I made the same kind of graphs with different values, with density and sulphates.

First of all, scatter plot with volatile.acidity and density with coloured quality

Since mamy regression lines are horizontal, volatile.acidity has stronger correlation with quality than density.

I made scatter plot with suphates.

Although quality 3’s regression line is vertical, other lines are almost horizontal. Again, there is a strong correlation with volatile.acidity and sulphates.

change quality value from factor to integer in order to do multiple regression.

## 
## Call:
## lm(formula = data$quality ~ data$volatile.acidity + data$alcohol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59342 -0.40416 -0.07426  0.46539  2.25809 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.09547    0.18450   16.78   <2e-16 ***
## data$volatile.acidity -1.38364    0.09527  -14.52   <2e-16 ***
## data$alcohol           0.31381    0.01601   19.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared:  0.317,  Adjusted R-squared:  0.3161 
## F-statistic: 370.4 on 2 and 1596 DF,  p-value: < 2.2e-16

Each independent variables’ p-values are small(<2e-16). Adjusted R-squared is 0.3161. Almost 30 % of quality value is explained by this model.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I compared with volotile.acidity and other thre features, such as alcohol, density and sulphates. As a result, volatile.acidity has stronger correlation with quality of wine over each of three values.

Were there any interesting or surprising interactions between features?

Only watching scatter plot does not make sense, but adding regression lines bring much meaning to me. I learned how impportant to look at the data from different points of views.

Final_Plots_and_Summary

First of all, from correlation matrix, I chose two variables, volatile.acidity and alcohol since it seems to me that there is correlation with quality in each scatterplot. So, I decided to take closer look at those variables by using cloured boxplot. These boxplots are Plot One and Plot Two.

Although both graphs show good result, from these boxplots, it looks like volatile.acidity is better variable to predict quality variable.

Therefore, I wanted a graph which includes all three variables at one time. I made scatterplot with colured quality and add regressions by each colours.

Surprisingly, by using regression lines, it is obvious that there is a stronger correlation between volatile.acidity and quality than alcohol and quality because most regression lines are drown horizontally.

Plot One

I chose this plot simply becaue it reflects my idea that there is a correlation between volatile.acidity and quality. At the same time I thought this coloured box plot could easily convey information of how volatile.acidity effect the quality of wine, the smaller volatile.acidity get, the better quality become.

Plot Two

I chose this plot as with almost same reason as Plot One. Although we can easily see that standard deviation is bigger in each boxplot, still the quality of wine tends to become better as its alcohol is stronger.

Plot Three

## data$quality: 3
## [1] 10
## -------------------------------------------------------- 
## data$quality: 4
## [1] 53
## -------------------------------------------------------- 
## data$quality: 5
## [1] 681
## -------------------------------------------------------- 
## data$quality: 6
## [1] 638
## -------------------------------------------------------- 
## data$quality: 7
## [1] 199
## -------------------------------------------------------- 
## data$quality: 8
## [1] 18

After confirming two correlaitons with quality in above plots, I needed to take closer look thrir relationships in one graph. Therefore, I made scatter plot with colored quality and regression lines. This graph clearly states volatile.acidity has stronger correlation with alcohol.

Also regression lines of quality 3 and quality 8 are not horizontal comppared with other lines, but as you can see, each number of plot only has 10 and 18, for quality 3 and quality 8 respectively. If we have more samples for quality 3 and 8, the regression lines might be changed.

Reflection

From my research, I confirmed that there is a correlation quality and some variables. Based on my plot three, there is a certain pattern in its graph. Volatile.acidity has stronger correaltion with quality than alchol. In addition, multiple regression made from plot three shows 0.3161 adjusted r-squared.

To be honest, before drawing regression lines in my plot three, I did not think I was doing well since correlations with quality and other variables look really weak. However, by using regression lines in the scatterer plot, my idea is forced to change since there was obviously something to tell the result. Through this courses, I learned how should I look from various points of views. In the future, I would like to learn different points of view to observe data so that I would not miss important points.